Quantization Fundamentals

Optimization
Author

Ritesh Kumar Maurya

Published

May 13, 2024

Handling Big Models

Current model compression technique:-

  1. Pruning:-remove connections that do not improve the model.

  2. Knowledge Distillation:- Train a smaller model(Student) using the original model(Teacher). Cons:-You need to have enough hardware to fit both teacher as weel student both.

Options to quantize:-

Fig.1 - A layer of Neural Network.
  1. You can quantize weights.

  2. You can quantize activations that propagate through the layers of neural network

Idea:Store the parameters of the model in ower precision

Data Types and Sizes

Integer

  1. Unsigned Integer (8-bit):- Range is [0, 255] [0, 2n-1] (All 8 bits are used to represent the number)

  2. Signed Integer (8-bit):- Range is [-128, 127] [-2n-1, 2n-1-1] (7 bits are used to represent the number and the 8th bit represent the sign 0:Positive 1:Negative)

Data Type torch.dtype torch.dtype alias
8-bit signed integer torch.int8
8-bit unsigned integer torch.uint8
16-bit signed integer torch.int16 torch.short
32-bit signed integer torch.int32 torch.int
64-bit signed integer torch.int64 torch.long
  • You can use below mentioned code to find out the more info
import torch
print(torch.iinfo(torch.int8))
iinfo(min=-128, max=127, dtype=int8)

Floating Point

3 components in floating point: Sign:- positive/negative (always 1 bit) Exponent(range): impact the representable range of the number Fraction(precision): impact on the precision of the number

  • FP32, BF16, FP16, FP8 are floating point format with a specific number of bits for exponent and the fraction.
  1. FP32
  • Sign: 1 bit
  • Exponent(range): 8 bit
  • Fraction(precision): 23 bit
  • Total: 32 bit
  1. BF16
  • Sign: 1 bit
  • Exponent(range): 8 bit
  • Fraction(precision): 7 bit
  • Total: 16 bit
  1. FP16
  • Sign: 1 bit
  • Exponent(range): 5 bit
  • Fraction(precision): 10 bit
  • Total: 16 bit

Comparison Of Data Types

Data Type Precision Maximum
FP32 Best ~10+38
FP16 Better ~1004
BF16 Good ~1038


Data Type torch.dtype torch.dtype alias
16-bit floating point torch.float16 torch.half
16-bit brain floating point torch.bfloat16
32-bit floating point torch.float32 torch.float
64-bit floating point torch.float64 torch.double
import torch
print("By default, python stores float data in fp64")
value = 1/3
tensor_fp64 = torch.tensor(value, dtype = torch.float64)
tensor_fp32 = torch.tensor(value, dtype = torch.float32)
tensor_fp16 = torch.tensor(value, dtype = torch.float16)
tensor_bf16 = torch.tensor(value, dtype = torch.bfloat16)

print(f"fp64 tensor: {format(tensor_fp64.item(), '.60f')}")
print(f"fp32 tensor: {format(tensor_fp32.item(), '.60f')}")
print(f"fp16 tensor: {format(tensor_fp16.item(), '.60f')}")
print(f"bf16 tensor: {format(tensor_bf16.item(), '.60f')}")

print(torch.finfo(torch.bfloat16))
By default, python stores float data in fp64
fp64 tensor: 0.333333333333333314829616256247390992939472198486328125000000
fp32 tensor: 0.333333343267440795898437500000000000000000000000000000000000
fp16 tensor: 0.333251953125000000000000000000000000000000000000000000000000
bf16 tensor: 0.333984375000000000000000000000000000000000000000000000000000
finfo(resolution=0.01, min=-3.38953e+38, max=3.38953e+38, eps=0.0078125, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=bfloat16)

PyTorch Downcasting

  • when a higher data type converted to a lower data type, it results in loss of data

  • Adavantages:

    • Reduced memory footprint
    • Increased compute and speed (Depends on the hardware)
  • Disadvantages:

    • Less precise
  • Use case:

    • Mixed precision training
      • Do computation in smaller precision (FP16/BF16/FP8)
      • Store and update the wights in higher precision (FP32)

Loading Models by data type

  • target_dtype = torch.float16 or torch.bfloat16

  • model = model.to(target_dtype)

  • model = model.half() for fp16

  • model = model.bfloat16() for bfloat16

  • Always use bfloat16 instead of float16 while using pytorch-cpu

  • FP32 is default in pytorch

  • model.get_memory_footprint()/1e+6

  • torch.set_default_dtype(desired_dtype) # By doing so we can directly load the model in desired dtype without loading in full precision and then quantizing it

  • set it back to float32 to avoid unnecesary behaviors

Quantization Theory

  • Quantization refers to the process of mapping a large set to a smaller set of values.

  • How do we convert the FP32 weights to INT8 without losing too much information??

    • It is done using a linear mapping using linear mapping parameters.
    • s = scale
    • z = zero point
  • How do we get back our original tensor from the quantized tensor?

    • We can’t get exactly the original tensor but using dequantization following linear relationship that used to quantize the original tensor.
Fig.2 - Comparison of tensors.

Quantize Using Quanto Library

  • from quanto import quantize, freeze
  • quantize(model, weights = desired_dtype, activations = desired_dtype)
  • freeze(model)
  • quantize create an intermediate state of the model
  • after calling freeze, we get the quantized weights

Uses of the Intermediate State

  • Calibration
    • Calibrate model when quantizing the activations of the model.
      • Range of activation values depends on what input was given. (e.g. different input text will generate different activations)
      • Min/Max of activation ranges are used to perform linear quantization.
      • How to get min and max range of activations?
        • Gather sample input data.
        • Run inference.
        • Calculate min/mac of activations
    • Result: better quantized activations
  • Quantization Aware Training
    • Training in a way that controls how the model performs once it is quantized.
    • Intermediate state holds both(quantized as well as unquantized weights).
    • Use quantized version of model in forward pass (e.g. BF16)
    • Update original, unquantized version of model weights during back propogation (e.g. FP32)
  • In L4 there is a function in helper.py to calculate the model size

Linear Quantization

Even if it looks very simple, it is used in many SOTA quantization methods:

  • AWQ: Activation-aware Weight Quantization
  • GPTQ: GPT Quantized
  • BNB: BitsandBytes Quantization
Fig.3 - Range
  • Simple idea: linear mapping
  • r = s(q-z)
    • where
    • r:- original value(e.g. in FP32)
    • s:- scale(e.g. in FP32)
    • q:- quantized value(e.g. in INT8)
    • z:- zero point(e.g. INT8)
  • How do we get scale and zero pint??
    • s = (r_max-r_min)/(q_max-q_min)
    • z = int(round(q_min-r_min/s))

Quantization of LLMs

Recent SOTA quantization methods:-

  • LLM.INT8 (only 8-bit)
    • Decomposes the mat-mul in two stages (outlier part in float16 and non-outlier part in int8).
  • QLoRA (only 4-bit)
    • Quantize as well as fine-tune the adapters
  • AWQ
  • GPTQ
  • SmoothQuant

More recent SOTA quantization methods for 2-bit quantization

  • QuIP#
  • HQQ
  • AQLM

All are open-source

Some Quantization Methods require calibration (from above)

Some Quantization Methods require Adjustments

Many of these methods were applied to LLMs, but if we want then we can apply to other type of models by making few adjustments to the quantization methods

  • Some methods can be applied without making adjustments

    • Linear quantization
    • LL.INT8
    • QLoRA
    • HQQ
  • Other approaches are data-dependent

  • There are distributors on HuggingFacewhich gives a quantized version of popular models (TheBloke)

  • Checkout HuggingFace Open LLM leaderboard to see how these quantized models are performing

  • Benefits of fine-tuning a quantized model:

    • Recover the accuracy from quantization
    • Tailor your model for specific use-cases and applications
  • Fine tune with Quantization Aware Training (QAT)

    • Not compatible with Post Training Quantization (PTQ) techniques.
    • The linear quantization method is an example of PTQ.
    • PEFT + QLoRA
      • QLoRa quantizes the pre-trained base weights in 4-bit precision.
      • This matches the precision of the LoRA weights.
      • This allows the model to add the activations of the pre-trained and adapter weights.
      • This sum of the two activations can be fed as the input to the next layer of the network.